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Abstract 



This paper develops a general theoretical framework to analyze structured sparse recovery 
problems using the notation of dual certificate. Although certain aspects of the dual certificate 
»— ] | idea have already been used in some previous work, due to the lack of a general and coherent 

theory, the analysis has so far only been carried out in limited scopes for specific problems. In 
this context the current paper makes two contributions. First, we introduce a general definition 
of dual certificate, which we then use to develop a unified theory of sparse recovery analysis 
^ | for convex programming. Second, we present a class of structured sparsity regularization called 

structured Lasso for which calculations can be readily performed under our theoretical frame- 
work. This new theory includes many seemingly loosely related previous work as special cases; 
it also implies new results that improve existing ones even for standard formulations such as l\ 
£SJ ' regularization. 

o" 

CO; 

^] ! 1 Introduction 

7— i ; 

This paper studies a general form of the sparse recovery problem, where our goal is to estimate 
a certain signal /3* from observations. We are especially interested in solving this problem using 
convex programming; that is, given a convex set f2, our estimator /3 is obtained from the following 
regularized minimization problem: 



/3 = argmin[L(/3)+ J R(/3)]. (1) 

Here L(f3) is a loss function, which measures how closely /3 matches the observation; and R(f3) is a 
regularizer, which captures the structure of Note that the theory developed in this paper does 
not need to assume that /3* € O although this is certainly a desirable property (especially if we 
would like to recover /?* without error). Our primary interest is in the case where fl lives in an 
Euclidean space O. However, our analysis holds automatically when Q is contained in a separable 
Banach space f2, and both L(-) and R(-) are convex functions that are defined in the whole space 
f2, both inside and outside of £1. 
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tResearch partially supported by the following grants: AFOSR- 10097389, NSA -AMS 081024, NSF DMS- 1007527, 
and NSF IIS-1016061 
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As an example, assume that /3* is a p dimensional vector: /3* S M p ; we observe a vector y £ M. n 
and annxp matrix X such that 

y = Xfi* + noise. 

We are interested in estimating from the noisy observation y. However, in modern applications we 
are mainly interested in the high dimensional situation where p> n. Since there are more variables 
than the number of observations, traditional statistical methods such as least squares regression will 
suffer from the so-called curse-of-dimensionality problem. To remedy the problem, it is necessary to 
impose structures on /3*; and a popular assumption is sparsity. That is ||/3*||o = |supp(^)| is smaller 
than n, where supp(/3) = {j : (3j ^ 0}. A direct formulation of sparsity constraint leads to the 
nonconvex £q regularization formulation, which is difficult to solve. A frequent remedy is to employ 
the so-called convex relaxation approach, where the £q regularization is replaced by an i\ regularizer 
R([3) = A||/3||i that is convex. If we further consider the least squares loss L(f3) = \\y — X/3|||, then 
we obtain the following t\ regularization method (Lasso) 

$ = arg mm i [\\y - X/3\\ 2 2 + A||/3||i], (2) 

p 

where is chosen to be the whole parameter space Cl = W. 

2 Related Work 

In sparse recovery analysis, we want to know how good is our estimator /3 in comparison to the 
target Consider the standard l\ regularization method ([2]), two types of theoretical questions 
are of interests. The first is support recovery; that is, whether supp(/3) = supp(/3*). The second 
is parameter estimation; that is, how small is ||/3 — /3*||2- The support recovery problem is often 
studied under the so-called irrepresentable condition (some types also referred more generally as 
coherence condition) |18[ 124] l3T] I26| , while the parameter estimation problem is often studied under 
the so-called restricted isometry property (or RIP) as well as its generalizations [8] [29 ] [2] I30 [ \25 \ [27] . 
Related ideas have been extended to more complex structured sparse regularization problems such 
as group sparsity |13[ [T7] and certain matrix problems |16[ |20| I15j . Closely related to parameter 
estimation is the so-called oracle inequality, which is particularly suitable for the dual-certificate 
analysis considered here. 

This paper is interested in the second question of parameter estimation, and the related problem 
of sparse oracle inequality. Our goal is to present a general theoretical framework using the notation 
of dual certificate to analyze sparse regularization problems such as the standard Lasso (|2|) as well 
as its generalization to more complex structured sparsity problems in (p}. We note that there were 
already some recent attempts in developing such a general theory such as |19| and |10] . but both 
have limitations. In particular the technique of [10] only applies to noise-less regression problems 
with Gaussian random design (its main contribution is the nice observation that Gordon's minimum 
singular value result can be applied to structured sparse recovery problems; the consequences will 
be further investigated in our paper); results in |10j are subsumed by our more general results 
given in Section 14.21 The analysis in [19j relied on a direct generalization of RIP for decomposable 
regularizers which has technical limitations in its applications to more complex structured problems 
such as matrix regularization: the technique of RIP-like analysis and its generalization such as 
|16l [20] gives performance bounds that do not imply exact recovery even when the noise is zero, 
while the technique we investigate here (via the notation of dual certificate) can get exact recovery 
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[22] • In addition, not all regularizers can be easily considered as decomposable (for example, 
the mixed norm example in Section [6.31 is not). Even for Gaussian random design, the complexity 
statement in Section 14.21 replies only on Gaussian width calculation that is more general than 
decomposable. Therefore our analysis in this paper extends those of |19j in multiple ways. 

While the notation of dual certificate has been successfully employed in some earlier work (espe- 
cially for some matrix regularization problems) such as |22l [4j [12] , these results focused on special 
problems without a general theory. In fact, from earlier work it is not even clear what should be a 
general definition of dual certificate for structured sparsity formulation ([I]). This paper addresses 
this issue. Specifically we will provide a general definition of dual certificate for the regularized 
estimation problem (p} and demonstrate that this definition can be used to develop a theoretical 
framework to analyze the sparse recovery performance of (3 with noise. Not only does it provide 
a direct generalization of earlier work such as \12\ U [T2], but also it unifies RIP type analysis (or 
its generalization to restricted strong convexity) such as [H [19] and irrepresentable (or incoherence) 
conditions such as |31[ I26| . In this regard the general theory also includes as special cases some 
recent work by Candes and Plan that tried to develop non-RIP analysis for i\ regularization [5j [6] . 
In fact, even for the simple case of l\ regularization, we show that our theory can lead to new and 
sharper results than existing ones. 

Finally, we would like to point out that while this paper successfully unifies the irrepresentable 
(or incoherence) conditions and RIP conditions under the general method of dual certificate, our 
analysis does not subsume some of the more elaborated analysis such as |30| and |27j as special case. 
Those studies employed a different generalization of RIP which we may refer to as the invertibility 
factor approach using the terminology of |27] . It thus remains open whether it is possible to develop 
an even more general theory that can include all previous sparse recovery analysis as special cases. 

3 Primal-Dual Certificate 

As mentioned before, while fragments of the dual certificate idea has appeared before, there are so 
far no general definition and theory. Therefore in this section we will introduce a formal definition 
that can be used to analyze ([1]). Recall that the parameter space Q lives in a separable Banach 
space Q. Let Cl* be the dual Banach space of containing all continuous linear functions u(f3) 
defined on fi. We use (u, f3) = u{f3) to denote the bi-linear function defined on Cl* xO, If O is 
an Euclidean space, then (•, •) is just an inner product. In this notation (•, •), the first argument is 
always in the dual space fj* and the second in the primal space Cl. This allows as to keep track of 
the geometrical interpretation of our analysis even when Cl is an Euclidean or Hilbert space with 
fj* = £1. In what follows, we will endow Cl* with the weak topology: Uk — > u iff (u^ — u, (3) — > for 
all f3 G Cl. This is equivalent to \\uk — u\\d — > for any norm || • \\d in 0* when Cl is an Euclidean 
space. 

In the following, given any convex function (/>(•), we use the notation V (/>(/?) G fi* to denote a 
subgradient of 4>(f3) with respect to the geometry of Q in the following sense: 

By convention, we also use d(p(/3) to denote its sub-differential (or the set of subgradient at (3). The 
sub-differential is always a closed convex set in Cl* . Moreover, we define the Bregman divergence 
with respect to <j> as: 

D4 (A/3') = - W) - (V0(/3'),/3 - /?'). 
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Clearly, by the definition of sub- gradient, Bregman divergence is non-negative. These quantities are 
standard in convex analysis; for example, additional details can be found in |23| . 

Instead of working directly with the target /3* , we consider an approximation (3 G Vt of /3* , which 
may have certain nice properties that will become clear later on. Nevertheless, for the purpose of 
understanding the main idea, it may be convenient to simply assume that f3 = (3* (thus (3* G Q) 
during the first reading. 

Given any ,8 6 and subset G C dR(/3), we define a modified regularizer 

R G ((3) = R(P)+ S up (v,P-$). 

It is clear that R G ((3) < R(/3) for all (3 and R(j3) = R G ((3). The value of R G {f3) is unchanged 
if G is replaced by the closure of its convex hull. Moreover, if G is convex and closed, then the 
sub-differential of R G {(3) is identical to G at (3 and contained in G elsewhere. In fact, by checking 
the condition R G (b) — R G ((3) > {v , b — (3) for b = t{3 and b = f3, we see that for closed convex G 

dR G {(3) = {veG:R G (p) = {v,P)=R(p) + {v,P-P)}. 

In what follows, we pick a closed convex G unless otherwise stated. 

In optimization, (3 is generally referred to as primal variable and VL(/3) as the corresponding 
dual variable, since they live in Cl and Cl* respectively. An optimal solution /3 of ([TJ) satisfies the 
KKT condition when its dual satisfies the relationship — VL(/3) G dR{(3). However, for the general 
formulation ([1]), this condition can be rather hard to work with. Therefore in order to analyze 
(p}, we introduce the notion of primal-dual certificate, which is a primal variable Q G satisfying a 
simplified dual constraint —VL(Q G ) G dR(/3). To be consistent with some earlier literature, one 
may refer to the quantity —X7L(Q G ) as the corresponding dual certificate. For notational simplicity, 
without causing confusion, in this paper we will also refer to Q G as a dual certificate. 

3.1 Primal Dual Certificate Sparse Recovery Bound 

The formal definition of dual certificate is given in Definition 13.11 In this definition, we also allow 
approximate dual certificate which may have a small violation of the dual constraint; such an 
approximation can be convenient for some applications. 

Definition 3.1 (Primal-Dual Certificate) Given any (3 G and a closed convex subset G C 
dR{(3). A 5 -approximate primal-dual (or simply dual) certificate Q G (with respect to G) of (Op is a 
primal variable that satisfies the following condition: 

- VL{Q G ) + 5&G. (3) 

7/5 = 0, we call Q G an exact primal-dual certificate or simply a dual certificate. 

We may choose a convex function L(f3) that is close to L{(3) and use it to construct an approximate 
dual certificate with 

Q G = argmin{Z(/3) + J R G (/3)}. (4) 

/3 

Since -VL(Q G ) G dR G (Q G ) C G, © holds for 5 = VL(Q G ) - VL(Q G ). However, this choice may 
not always lead to the best result in the analysis of the estimator ([!]), especially when —X7L(Q G )+5 = 
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— VL(Qg) is an interior point of G. Possible choices of L(/3) include jL0) with a constant 7, its 
expectation, and their approximations. Note that we do not assume that Qq G $7. In order to 
approximately enforce such a constraint, we may replace L{f3) by L{(3) + £a(/3) for an Y convex 
function L/±(f3) > such that L&(f3) = when f3 G f2. If L&(f3) is sufficiently large, then we can 
construct a that is approximately contained in fL More detailed dual certificate construction 
techniques are discussed in Section HI 

An essential result that relates a primal-dual certificate Qg to (3 is stated in the following 
fundamental theorem, which says that if Qg is close to /3, then f3 is close to (3 (when 5 = 0). In 
order to apply this theorem, we shall choose f3 ~ f3*. 

Theorem 3.1 (Primal-Dual Certificate Sparse Recovery Bound) Given an approximate primal- 
dual certificate Qg in Definition \3.1[ we have the following inequality: 

D L 0, 0) + D L 0, Q G ) + [R0) - R G 0)} < D L 0, Q G ) -(sj-fi). 



The proof is a simple application of the following two propositions. 

Proposition 3.1 For any convex function L(-), the following identity holds for Bregman divergence: 

D L (a, b) + D L (b, c) - D L (a, c) = (VL(c) - VL(6), a - b). 

Proof This can be easily verified using simple algebra. We can expand the left hand side as follows. 

D L (a,b) + D L (b,c)-D L (a,c) 
= [L(a) - L(b) - (VL(6), a - b}] + [L(b) - L(c) - (VL(c), b - c)] - [L(a) - L(c) - (VL(c),o - c)] 
= - (VL(b),a - b) - (VL(c), b - c) + (VL(c), a - c). 

This can be simplified to obtain the right hand side. ■ 

Proposition 3.2 Let f3 = t/3 + (1 — t)f3 for some t G [0, 1]. Then, given any v G G, we have 

(-v - VL0), 0-0)< R G 0) - R0). 

Proof The definition of f3 and the convexity of ([T]) imply that (5 achieves the minimum objective 
value L{(3) + R{j3) for (3 that lies in the line segment between f3 and ft. This is equivalent to 
(VL0) + VR0),P - j3) > 0. Since R(-) is convex, this implies (VL0),P-j3) + R0) > R0). 
Thus, 

(_„ _ V L(/3), p-p)<( v J-p) + R0) - R0) < R G 0) - R0) 
by the definition of Rg{P). ■ 

Proof of Theorem 13.11 We apply Proposition 13.11 with a = (3, b = (3, and c = Qg to obtain: 

D l 0J) + D l 0,Q g )-D l 0,Qg) = (VL(Q G )-VL0)J-P) = (-v + S - VL0) J - /§), 
where v G G. We can now apply Proposition 13.21 with t = 1 to obtain the desired bound. ■ 

The results shows that if we have a good bound on Dl(/3,Qg), then it is possible to obtain a 
bound on Dl0, f3). In general, we also choose G so that the difference R{j3) —Rg{(3) can effectively 
control the magnitude of f3 outside of the support (or a tangent space) of (3. 
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3.2 Primal Dual Certificate Sparse Oracle Inequality 



It is also possible to derive a stronger form of oracle inequality for special L with a more refined 
definition of dual certificate. 

Definition 3.2 (Generalized Primal-Dual Certificate) Given (3 G 0, a dosed convex set G C 
dR(P), a convex function L on £1, and an additional parameter /3* S Cl. A generalized 5 -approximate 
primal-dual (or simply dual) certificate Qg with respect to (L, L, (3, /?*) is a primal variable that 
satisfies the following condition: 

-VL*(Q G )+5 EG, (5) 
where L*(p)=L(P)-(VL(P)-VL(P*),p-p). 

Note that if {•, ■) is an inner product and L is a quadratic function of the form 

L(P) = (H/3-z,P) (6) 

for some self-adjoint operator H and vector z, then Dl(P,P') = {H(P — P'),P — P'). In this case, 
we may simply take L(-) = L(-). For other cost functions, it will be useful to take L(-) = jL(-) with 
7 < 1. The reason will become clear later on. 

Definition 13.21 is equivalent to Definition 13.11 with L(P) replaced by a redefined convex function 
L*(P) = L(P) - {VL(P) - VL(P*),P- P). We may consider /3* to be the true target /3* (or its 
approximation) in that we can assume that VL(/3*) is small although /3* may not be sparse. The 
main advantage of Definition 13.21 is that it allows comparison to an arbitrary sparse approximation 
P to /?* even when VL(/3) is not small — the definition only requires VL*(P) = VL{P*) to be small. 
This implies that P may have a dual certificate Qg with respect to £*(•) that is close to P (see 
error bounds in Section The following result shows that one can obtain an oracle inequality that 
generalizes Theorem 13.11 In order to apply this theorem, we should choose /3* ~ 

Theorem 3.2 (Primal-Dual Certificate Sparse Oracle Inequality) Given a generalized 5 ap- 
proximate primal- dual certificate Qg in Definition \3.2\ we have for all P in the line segment between 
P and P: 

D L (P, P) + D L (P, + D L (P, Q G ) + [R(P) - RgCP)] 

< D L (p,p) + D L (p,p*) + D- L (p,Q G ) -(5,p-P). 



Proof We apply Proposition 13. II with a = P, b = P, and c = /?* to obtain: 

D L (P,P) + D L 0,P*)-D L (P,P*) = (VL(P*)-VL0)J-P). 

Similarly, we can apply Proposition 13.11 with a = P, b = P, and c = Qg to L to obtain: 

D L (p,p) + D L (p,Q G ) - d- l (p,q g ) = (VL(Q G )-VL(p),p-p). 

By subtracting the above two displayed equations, we obtain 

D L (P, P) + D L (P, p.) - D L (P, &) - D L (P, P) - D L (P, Q G ) + D L 0, Q G ) 
=(VL(P*) - VL(P) + VL{Q G ) - VL(P)J - P). 
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Since X7L(Qq) + VL(/3*) — VL(/3) = —v + <5 for some v £ G, the right hand side can be written as 
{—v + 5 — VL(/3),/3 — /3). The conclusion then follows from Proposition 13.21 ■ 



Note that if we choose L = L and /3* = /3 in Theorem 13.21 then Definition 13.21 is consistent 
with Definition 13. 1\ and Theorem 13.21 becomes Theorem 13.11 Since L*(/3) — L(f3) is linear in /3, 
Di(/3,Qg) = D^^P^Qg)- Moreover, when VL(/3*) is small, VL*(/3) is small by the choice of 
L*(-) in Definition 13.21 so that Di^((3,Qq) is small when has sufficient convexity near /3. This 
motivates a choice L(-) satisfying Di(P,f3) < Dl((3,P) for all (3 G Q whenever such a choice is 
available and reasonably convex near /3. This lead to the following corollary. 

Corollary 3.1 Given a generalized exact primal-dual certificate Qq in Definition \3.S\ with L(-) 
satisfying D L ((3,f3) > D L (f3,(3) for all /3 G Q. Then, 

D L 0,PJ + [R0) - R G 0)] < D L (P,P*) + D Lm (p,Q G ). 

In some problems, Corollary 13.11 is applicable with L(-) = 'jL(-) for some 7 G (0,1]. In the 
special case that L(-) is a quadratic function as in ([6]), we have Dl((3,/3) = Dl(/3,(3). Therefore we 
may take 7 = 1, and the bound in Corollary 13.11 can be further simplified to 

D L 0, &) + [R0) - R G 0)} < D L 0, &) + D L 0, Q G ). 

If L(-) comes from a generalized linear model of the form L(f3) = Y27=l ^(( x «; fi))-, with X\ G ^* 
and second order differentiable convex scalar functions li, then the condition Dl{(3, /3) > jDl(/3, f3) 
is satisfied as long as: 

PenD L (P,P) {P\P"}^Y^ =1 l'!{{x h p")){x h p - pf ie{i,...,n} {P',/3"}eQ £'^(xi, P")) ~ 

This means that the condition of Corollary 13.11 holds as long as for all i,P,P' G £1: l'-({xi,P)) > 
7^'((xj, P')). For example, for logistic regression £i(t) = ln(l + exp(— t)) with sup^sup^gQ < 
^4, we can pick 7 = 4/(2+exp(— A)+exp(A)). This choice of 7 can be improved if we have additional 
constraints on /?; an example is given in Corollary 14.21 In 16.41 we will present a more concrete and 
elaborated analysis for generalized linear models. 

Note that the result of Corollary 13.11 gives an oracle inequality that compares Dl(/3,P*) to 
Dl{P,P*) with leading coefficient one. The bound is meaningful as long as /3 has a good dual 
certificate Qq under L*(/3) that is close to p. The possibility to obtain oracle inequalities of this 
kind with leading coefficient one was first noticed in [16] under restricted strong convexity. The 
advantage of such an oracle inequality is that we do not require /3* to be sparse, but rather the 
competitor f3 to be sparse — which implies the dual certificate Qq is close to (3 when L*(/3) is 
sufficiently convex. Here we generalize the result of [16] in two ways. First it is possible to deal 
with non-quadratic loss. Second we only require the existence of a good dual certificate Qq, which 
is a weaker requirement than restricted strong convexity in |16| . 

Generally speaking, the dual certificate technique allows us to obtain oracle inequality Dl(/3, P*) + 
[R(P) — Rg(P)] directly. If we are interested in other results such as parameter estimation bound 
||/3 — /3*||, then additional estimates will be needed on top of the dual certificate theory of this paper. 
Instead of working out general results, we will study this problem for structured l\ regularizer in 
Section [5j 
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4 Constructing Primal-Dual Certificate 

We will present some general results for estimating Dl({3,Qq) under various assumptions. For 
notational simplicity, the main technical derivation considers Definition 13. 11 with dual certificate Qq 
with respect to L(j3). One can then apply these results to the dual certificate Qq in Definition 13.21 

4.1 Global Restricted Strong Convexity 

We first consider the following construction of primal-dual certificate. 
Proposition 4.1 Let 

Q G = argmin[L(/3) + J R G (/3)], (7) 
then Qq is an exact primal-dual certificate of |7p. 

Proof It is clear from the optimality condition of ([7]) that X7L(Qq) + v = for some v £ G. ■ 

The symmetrized Bregman divergence is defined as 

D S L (P, P) = D L (p, P) + D L (P, P) = (VL(/3) - VL(P),P - P). 
We introduce the concept of restricted strong convexity to bound D s l {P,Qq). 

Definition 4.1 (Restricted Strong Convexity) We define the following quantity which we refer 
to as global restricted strong convexity (RSC) constant: 

7 l(^; r, G, || • ||) = inf { ffi^ : < ||/3 - 0\\ < r; £>£(/?, $) + sup (u + VL$),P - $) < 

where \\ ■ \\ is a norm in CI, r > and G C dR(f3). 

The parameter r is introduced for localized analysis, where the Hessian may be small when \\/3— /3\\ > 
r. For least squares loss that has a constant Hessian, one can just pick r = oo. 
We recall the concept of dual norm in ti: \\ ■ \\d is the dual norm of || • || if 

\\u\\d = sup (u, P). 
11/311=1 



It implies the inequality that (u,P) < \\u\\d 

Theorem 4.1 (Dual Certificate Error Bound under RSC) Let \\ • || be a norm in Cl and \\ ■ \\d 

its dual norm in Vt* . Consider (3 G Q, and a closed convex G C dR(P). Let A r = 7x(/3;r, G, \\ ■ 
||) _1 inf ug G 1 1 it + VL(P)\\d- If A r < r for some r > 0, then for any Qq given by 

D s L (P,Q G )< 7L (P;r,G,\\ ■ ||)A r 2 , \\P - Q G \\ < A r . 
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Proof By the optimality condition (J7J) of Qg, there exists v G 8Rg(Qg) such that V 'L(Qq)+v = 0. 
For v G 8R G (Q G ), R g (Qg) ~ R(P) = (v, Qg ~ P) > su PueG <u, Q G - P). Therefore, 

DUQgJ) = (VL(Q G )-VL(p),Q G -P) < —(u + VL(/3),Qg - V u G G. (8) 

Let Q G = ft + i(Q G - /3) where we pick t = 1 if ||Q G - /3|| < r and i G (0, 1) with \\Q G - f3\\ = r 
otherwise. Let f(t) = Dl(Q g , /3) so that D S L (Q G ,(3) = tf'(t). The convexity of L((3) implies 
/'(*) < /'(I) = D s l (QgJ). It follows that 

DHQ G , P) + (u + VL03), Q G - P) < t{D s L (Q G ,p) + (u + VL(P),Q G - P)} < 0, 

which implies the restricted cone condition for Qg in the definition of RSC. Thus, 

lL (P;r, G, || • \\)\\Q G ~ Pf ~ \\u + VL(P)\\d\\Qg ~ P\\ < 0. 

Now by moving the term \\u + VL(/3)||£>||Q G — j3\\ to the right hand side and taking inf over u, we 
obtain lL 0;r,G,\\ ■ \\)\\Q G - p\\ < inf ueG ||u + VL(p)\\ D = lL (P;r,G, \\ ■ ||)A r . Since A r < r, we 
have t = 1 and Q G = Q G . It means that we always have ||Q G — /3|| < A r < r. Consequently, ([8]) 
gives D s l {Qg-,P) < inf nGG ||u + VL(/3)||£)A r . This completes the proof. ■ 



Remark 4.1 Although for simplicity, the proof of Theorem \4-l\ implicitly assumes that the solution 
°f (CP is finite, this extra assumption is not necessary with a slightly more complex argument (which 
we excludes in the proof in order not to obscure the main idea). An easy way to see this is by adding 
a small (unrestricted) strongly convex term L&(P) to L and consider dual certificate for the modified 
function L(/3) = L(/3) +L&(/3). Since the solution of £7^ with L(f3) is finite, we can apply the proof 
to L(/3) and then simply let L&(P) — > 0. 

Note that if VL(Q G ) is not unique, then the same value can be used both in Theorem 13.11 and 
in Theorem 14.11 Since Dl(/3,Q g ) < D S L (P,Q G ), this implies the following bound: 

Corollary 4.1 Under the conditions of Theorem \4-l\ we have 

D L (p, P) + [R0) - R G 0)\ < Jl(P; r, G, || • U)" 1 inf || u + VL(P)\\ 2 D . 



Similarly, we may apply Theorem 13.21 and Theorem 14. II with L(f3) replaced by as in Defini- 

tion 13.21 This implies the following general recovery bound. 

Corollary 4.2 Let \\ ■ \\ be a norm in CI and \\ ■ \\d its dual norm in CI*. Consider (3 G CI and a 
closed convex G C dR(j3). Consider L((3) as in Definition \3.SX and define 

Tu (P; r, G, || • ||) = inf | : ^ ~ ^ ^ r ' P) + ™g <« + VL(ft), P - P) < J 

and A r = (7^ (/3; r, G, || • H))" 1 inf ugG \\u + VL(/3*)||£>. Assume for some r > 0, we have A r < r; 
and assume there exists r > Dl(P,P*) + 7^ (j3;r,G, \\ • ||)A^ such that for all P G Cl: Dl(P,P*) + 
[R{P) - R G {P)] < f implies D L {p,p) > D L {p,p). Then, 



D L (P,P*) + [R(P) - Rg(P)} < D L (P,P*)+nMr,G, || • ||)A 



r ■ 
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Proof Let L„(/3) = L(p) - (VL(p) - VL(/3*),/3 - P) and define 



Q G = argmin [Z*(/3) + i? G (^)] . 

Then Qq is a generalized dual certificate in Definition 13.21 Note that (P,P') = Di(f3,f3') and 
VL*(/3) = VL (/?*). The conditions of the corollary and Theorem l4.ll applied with L replaced by L*, 
imply that \\Q G -f3\\ < r and D L 0,Q G ) < j Lit 0;r,G, \\ ■ ||)A 2 . Now we simply apply TheoremEO] 
to obtain that for all t G [0, 1] and (3 = f3 + t0 - (3): 

D L 0, + [R0) - Ra0)) < D L 0, P*) + 71, (^; r, G, || • ||)A r 2 + [^(^ /3) - D^^, /3)]. 

It is clear that when t = 0, we have Dl0,/3*) + [R0) ~ Rg0)] < f. If the condition D L (^,/3*) + 
[i?(/3) — i?G(/3)] < r holds for t = 1, then the desired bound is already proved due to the condition 
D L 0J) > D L 0J). Otherwise, there exists t G [0,1] such that D L 0,p m ) + [R0) - Rq0)] = r. 
However, this is impossible because the same argument gives 

Dl0,P*) + \R0) ~ R G 0)} < D L 0,^) + lit 0;r,G, \\ ■ ||)A r 2 < r. 

This proves the desired bound. ■ 

Corollary 14.21 gives an oracle inequality with leading coefficient one for general loss functions, 
but the statement is rather complex. The situation for quadratic loss is much simpler, where we 
can take L((3) = L((3). This is because the condition Dl0,(3) > Di([3,(3) always holds. We also 
have a better constant because D S L ((3,(3') = 2D L (f3,f3') = 2D L (f3' , f3). 

Corollary 4.3 Assume that L{(3) is a quadratic loss in (0|). Let \\ ■ \\e> and \\ ■ \\ be dual norms, and 
consider /3 G and a closed convex G C dR{(3). We have 

D L 0, &) + [R0) - Rg0)\ < D L 0, P*) + (2 7 l, 0; oo, G, \\ ■ H))" 1 inf ||n + VL(^)IId, 

where 

7£ . 0; oo, G, || • ||) = inf { : 2D L {fi, P) + sup (u + VL(^), P - P) < 1 . 

I Hp - P\\ u&g J 



4.2 Quadratic Loss with Gaussian Random Design Matrix 

While in the general case, the estimation of Ji^(P;r,G, \\ ■ ||) may be technically involved, for the 
special application of compressed sensing with Gaussian random design matrix and quadratic loss, 
we can obtain a relatively general and simple bound using Gordon's minimum restricted singular 
value estimation in |11] , This section describes the underlying idea. 
In this section, we consider the quadratic loss function 

L(/3) = ||X/3-y|| 2 , (9) 

where P G MP, Y G M. n , and X is an n x p matrix with iid Gaussian entries iV(0, 1). Here (•, •) is the 
Euclidean dot product in MP: (u, v) = u T v for u, v G MP. 
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Definition 4.2 (Gaussian Width) Given any set C C M p , we define its Gaussian width as 

width(C) = E e sup e T z, 

zeC;\\z\\ 2 =l 

where e ~ N(0, I p xp) an d E e is the expectation with respect to e. 

The following estimation of Gaussian width is based on a similar computational technique used 
in [TO]. 

Proposition 4.2 Let C = {/3 G W : sup ueG (u + VL(/3*), /3) < 0} and e ~ N(0, I pxp ). Then, 

width(C)<E e inf || 7 (u + VL(/3,))-e|| 2 . 

mGG;7>0 

Proof For all G C and ||/3|| 2 = 1, 7 > 0, and u G G, let g = (u + VL(/3*)). We have 
( 5 , 0) = {u + VL{p*),P) < 0. Therefore, e T /3 = (e - 75 ) T /3 + 75 T /3 < (e - 75 ) T /3 < ||e - jg\\ 2 . 
Since it is arbitrary, we have 

e T (3< inf || 7 (n + VL(/3*))-e|| 2 . 

uGG;7>0 

Taking expectation with respect to e, we obtain the desired result. ■ 

Gaussian width is useful when we apply Gordon's restricted singular value estimates, which give 
the following result. 

Theorem 4.2 Let f min (X) = min 2eC .|| z || 2=1 ||Xz|| 2 and f max (X) = max 2eC .|| 2 || 2=1 \\Xz\\ 2 . Let X n = 
\/2r((n + l)/2)/r(n/2) where T(-) is the T '-function. We have for any 5 > 0: 

P [fmm(X) <X n - width(C) - 5] < P[JV(0, 1) > S] < 0.5 exp (~5 2 /2) , 

P [/maxPO > A n + width(C) + 5} < P[iV(0, 1) > 6} < 0.5 exp {-5 2 /2) . 

Proof Since both f m m(X) and / max (l) are Lipschitz-1 functions with respect to the Frobenius 
norm of X. We may apply the Gaussian concentration bound [3j [21] to obtain: 

P [/minPQ < B[f min (X)} -5}< P[JV(0, 1) > S\, 

P [/maxW > E[/ max (X)] + 5] < P[iV(0, 1) > 5}. 
Now we may apply Corollary 1.2 of |11] to obtain the estimates 

E[f min (X)] >X n - width(C), E[/ max (X)] < A n + width(C), 

which proves the theorem. ■ 

Note that we have njyjn + 1 < A n < ^Jri. Therefore we may replace A n — width(C) by n/y/n + 1 — 
width(C) and A n + width(C) by ^/n + width(C). By combining Theorem 14.21 and Proposition 14 . 2 1 to 
estimate ji (•) in Corollary 14.31 we obtain the following result for Gaussian random projection in 
compressed sensing. The result improves the main ideas of |10| . 
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Theorem 4.3 Let L((3) be given by and e ~ N(0,I pxp ). Suppose the conditions of Theorem \4-l\ 
hold. Then, given any g, 5 > such that g + 5 < n/y/n + 1, with probability at least 

1_ I eX p ^-I( n /Vn+l-g-5) 2 ^) , 



we /iowe either 



\X(P ~ + W) " Rg{P)] < \\X$ - + (45)- 1 inf \\u + VL(/3*)lll, 



or 



S<E e inf || 7 (« + VL(ft))-e||2. 

«GG;7>0 

Proof Let || • || = || • ||j; = || • || 2 in Corollary 14.31 We simply note that ji (j3;oo,G, \\ ■ H2) is no 
smaller than inf{2||X/3|| 2 : ||/3|| 2 = 1,P € C}, where C = {/3eK" : sup u6G \u + VL(/3*), /3> < 0}. 
Let Ei be the event g > E e inf ug c ;7> o || 7 (u + VL(/3*)) — e|| 2 - In the event i?i, Proposition 14.2 
implies g > width(C), so that by Theorem 14.21 



7 £t (/?;oo,G, || • || 2 ) < 25 and E 1 



< P 



inf{||X/3|| 2 :||/3|| 2 = l,/3GC}< ( 5 



£1 



< -e 



-(A„- 9 -5) 2 /2 



The desired result thus follows from Corollary 14.3 



Remark 4.2 IfY = + e, tuii/i izd Gaussian noise e ~ N(0,a 2 L nxn ), then the error bound in 
Theorem \4.3\ depends on inf n£ G \\u + V.L ) 1 1 1 = 'na£ u eG \\ u + 2X T Xe||| « 2ncr 2 inf uS G || 7 ^ + e||| 
to/ien X T X/n is near orthogonal, where 7 = 0.5<r _2 /n. In comparison, under the noise free case 
a = ('and VL(/3*) = f/ie number of samples required in Gaussian random design is upper 
bounded by 




for appropriate 7. The similarity of the two terms means that it is expected that the error bound in 
oracle inequality and the number of samples required in Gaussian design are closely related. 

4.3 Tangent Space Analysis 

In some applications, the restricted strong convexity condition may not hold globally. In this 
situation, one can further restrict the condition into a subspace T of 0, call tangent space in the 
literature. We may regard tangent space as a generalization of the support set concept for sparse 
regression. A more formal definition will be presented later in Section 15.21 In the current section, 
it can be motivated by considering the following decomposition of G: 

G = {u + tti : u G G C G,u x G Gi}, (10) 

where Gi is a convex set that contains zero. Note that we can always take Go = G and Gi = {0}. 
However, this is not an interesting decomposition. This decomposition becomes useful when there 
exist Go and Gi such that Go is small and Gi is large. With this decomposition, we may define the 
tangent space as: 

T = {/3 G Cl : {ui,P) = for all Ul G Gi}. 
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For simple sparse regression with i\ regularization, tangent space can be considered as the subspace 
spanned by the nonzero coefficients of ft (that is, support of p). Typically j3 G T (although this 
requirement is not essential). 

With the above defined T, we may construct a tangent space dual certificate Qq given any 
uq G Go as: 

Ql = P + AQ, AQ = arg min r [L(^ + A/3) + (n ,A/3)] . (11) 

Note that one may also define generalized dual tangent space certificate simply by working with 
L*03) = L(J3) - (VL(P) - VL(&),/9 - p) instead of L{0). 

The idea of tangent space analysis is to verify that the restricted dual certificate Qq is a dual 
certificate. Note that to bound D s L (j3, Qq), we only need to assume restricted strong convexity inside 
T, which is weaker than globally defined restricted convexity in Section 14.11 The construction of 
Qq ensures that it satisfies the dual certificate definition in T according to Definition 13.11 in that 
given any f3 G T : (VL(Qj) - u ,f3) = 0. However, we still have to check that the condition ([3]) 
holds for all /3 G to ensure that Qg = Qq is a (globally defined) dual certificate. The sufficient 
condition is presented in the following proposition. 

Proposition 4.3 Consider Qq in Ml\) - If —VL(Qq) — uq G G±, then Qg = Qq is a dual certificate 
that satisfies condition |3J). 

Technically speaking, the tangent space dual certificate analysis is a generalization of the irrep- 
resentable condition for l\ support recovery |31| . However, we are interested in oracle inequality 
rather than support recovery, and in such context the analysis presented in this section generalizes 
those of [HE]. 

Definition 4.3 (Restricted Strong Convexity in Tangent Space) Given a subspace T that 
contains (3, we define the following quantity which we refer to as tangent space restricted strong 
convexity (TRSC) constant: 

r, G, || • ||) = inf j : ||/3 - ^|| < r; /3 - /3 G T; D s L (/3, /3) + (u + VL(^), f3 - 0) < j , 

where \\ ■ \\ is a norm, r > and G C dR(/3). 

Theorem 4.4 (Dual Certificate Error Bound in Tangent Space) Let \\-\\d and || • || be dual 
norms, and consider convex G C dR(f3) with the decomposition \1(J\) . If inf ue G ||n + VL(/3)\\rj < 
r ■ -y1(i3; r, G, || • ||) for some r > 0, then 

DU^Ql) < hI(P;r,G,\\ • ||))- 1 ||n + P r VL(^)||f ) , 

where Qq is given by Ml\) - 

If the condition inf^gc ||u + VL(/3)||£> < r • 7^(/3;r, G, \\ ■ ||) holds for some r > 0, then The- 
orem 14.41 implies that (jlip has a finite solution. However, the bound using Theorem 14.41 may not 
be the sharpest possible. For specific problems, better bounds may be obtained using more refined 
estimates (for example, in |12|). If Qq is a globally defined dual certificate in that ([3]) holds, then 
we immediately obtain results analogous to Corollary 14.11 and Corollary 14.31 
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Let /3* be the target parameter in the sense that VL( ( 5 Hl ) is small. If we want to apply Theorem l3.2l 
in tangent space analysis, it may be convenient to consider the following choice of /3* instead of 
setting /3* to be the target /3*: 

& = /3* + A/3*, A/?* = arg min L(/3* + A/3). (12) 

A/3eT 

The advantage of this choice is that /3* is close to the target /3*, and thus VL(/3*) is small. 
Moreover, (V-L(/3*), /3) = for all f3 G 7", which is convenient since it means (VL*(/3),/3) = 
for all /3 G T with L*(/3) = L(/3) - (VL(/9) - VL(fa) T {P - 0). 

For quadratic loss of ([6]), we have an analogy of Corollary 14.31 Since (•, ■) becomes an inner 
product in a Hilbert space with Cl = f2*, we may further define the orthogonal projection to T as 
Pj- and to its orthogonal complements T 1 ' as P.j . It is clear that in this case we also have G\ C ■ 

Corollary 4.4 Assume that L{f3) is a quadratic loss as in Consider convex G C dR{/3) with 
decomposition in UOty . Consider such that 2H/3* — z = a + b with a G T and b G T 1 ' ■ 

Assume Hf, the restriction of H to T, is invertible. If uq G T, then let 

AQ = -Q.bH^ l (u + a) = arg mm [(HA/3, A/3) + (u + a, A/3}] . 

I/P^HH^uq -bed, then 

D L 0, fa + [R0) - R G 0)] < D L 0, fa + 0.25(u + a, H^(u + o)>. 

Proof Let Qg = (3 + AQ, then Qg is a generalized dual certificate that satisfies condition ([5]) with 
L = L. This is because 

-VZ*(Q G ) - u = - 2HAQ -a-b-uo 

= — 2HH^ 1 (uo + a) — a — b — uq 

= - 2P T HH^ 1 (u + a) - 2P^HH^ 1 (u + a)-d-b-u 
=P^HH^ 1 (u + 5) - 6 e Gi. 

We thus have 

D L 0, fa + [R0) - R G 0)] < D L 0, fa + D L 0, Q G ). 
Since Dl(/3,Qg) = (HAQ,AQ) = 0.25(uo + a,H^- l (uo + a)), the desired bound follows. ■ 

If /3* is given by (I12p . then a = 0, and Corollary 14.41 can be further simplified. 

5 Structured i\ regularizer 

This section introduces a generalization of l\ regularization for which the calculations in the dual 
certificate analysis can be relatively easily performed. It should be noted that the general theory 
of dual certificate developed earlier can be applied to other regularizers that may not have the 
structured form presented here. 
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Recall that 0, is a Banach space containing Q, Q* is its dual, and (u, (3) denotes u(/3) for linear 
functionals u G 0*. Let Eq be either a Euclidean (thus £±) space of a fixed dimension or a countably 
infinite dimensional l\ space. We write any Eo-valued quantity as a = (ai,a2, . . .) T and bounded 
linear functionals on Eq as w T a = J2j w j a j = ( w i °)j with w = (wi,W2, . . .) T G loo- Let ^# be the 
space of all bounded linear maps from 0, to Eq. 

Let s/ be a class of linear mappings in jft . We may define a regularizer as follows: 

R(P) = \\PU, \\PU = sup P/3||i. (13) 

As a maximum of seminorms, the regularizer ||/3||,jz/ is clearly a seminorm in {/3 : R(f3) < oo}. The 
choice of s/ is quite flexible. We allow R(-) to have a nontrivial kernel ker(i?) = n^g^ker^) . 
Given the sZ-norm || • ||^ on Q, we may define its dual norm on fj* as 

IMU,D = sup{(u,/3) : II/3H.0/ < 1}. 

Since may take zero-value even if f3 ^ 0; this means that H^H^d may take infinite value, 

which we will allow in the following discussions. 

We call the class of regularizers defined in (|13f> structured-^i (or structured-Lasso) regularizers. 
This class of regularizers contain enough structure so that dual certificate analysis can be carried 
out in generality. In the following, we shall discuss various properties of structured i\ regularizer 
by generalizing the corresponding concepts of t\ regularizer for sparse regression. This regularizer 
obviously includes vector £\ penalty as a special case. In addition, we give two more structured 
regularization examples to illustrate the general applicability of this regularizer. 

Example 5.1 Group l\ penalty: Let Ej be fixed Euclidean spaces, Xj : 0, — > Ej be fixed linear 
maps, Xj be fixed positive numbers, and sf = {(vj X\, vJX2, ■ ■ .) T : Vj G Ej, \\vjW2 < Aj}. Then, 

R((3)= sup \\Ap\\i=I2 j *i\\X3Ph- 

Example 5.2 Nuclear penalty: contains matrices of a fixed dimension. Let Sj(/3) > Sj+i(f3) de- 
note the singular values of matrix j3 and s/ = {^4 : Af3 = (uij(U T (3V)jj,j > 1), U T U = I r , V T V = 
L r ,r > 0,0 < Wj < A}. Then, the nuclear norm (or trace-norm) penalty for matrix j3 is 

R((3)= sup 114311! = A TV-OS). 

5.1 Subdifferential 

We characterize the subdifferential of R{f3) by studying the maximum property of stf ' . A set si is 
the largest class to generate (I13D if for any Aq G sup ^g^j { 1 1 ^4o /5 1 1 1 — R(P)} = implies Aq G s/. 
We also need to introduce additional notations. 

Definition 5.1 Given any map M G , define its dual map M* from to £l* as: \/w G £00; 
M*w satisfies (M*w,(3) = w T (M(3), V/3 G 0. Given any w G £oo, define w(-) as a linear map from 
Ji — > Q* as w(M) = M*w. We also denote by w(si) the closure of w(si) in U* . 
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The purpose of this definition is to introduce e G so that R((3) can be written as 

R(P) = sup (e(A),/3) = sup (it,/3). 

In this regard, one only needs to specifiy e(&/) although for various problems it is more convenient 
to specify s/. Using this simpler representation, we have the following result characterizes the 
sub-differentiable of structured l\ regularizer. 

Proposition 5.1 Let E\ = {w = (wi,W2, ■ ■ -) T G ^oo : \wj\ = 1 V j} and e = (1, 1, ...) G E\. 

(i) A set si is the largest class generating R(P) iff the following conditions hold: (a) w(s/) = e{si) 
for all w G E±; (b) si is convex; (c) si = C\ w <^e± w ~ 1 {e-i^i)) , where w" 1 is the set inverse function. 

(ii) Suppose si satisfied condition (a) in part (i). Then, R(P) = sup^ gc ^ {e(A),P). 
(Hi) Suppose si satisfied conditions (a) and (b) in part (i). Then, for R(P) < oo, 

dR{p) = {u g ejs7) :Aea?, (u, p) = R(p)}. 

In what follows, we assume si satisfied conditions (a) and (b) in (i). For notational simplicity, 
we also assume e(s/) = e(s/), which holds in the finite-dimensional case for closed si. This gives 

dR(P) = {u G e{sf) :Aesi, (u, 0) = R(P)}. (14) 

Condition (c) in part (i) is then nonessential as it allows permutation of elements in A. Condition 
(c) holds for the specified si in Example 15.21 but not in Example 15.11 
Proof We assume (a) since it is necessary for si to be maximal in part (i). 

(ii) Under (a), sup Ag£/ (e(A),P) = SMp weEltAej , (w(A),P) = sup Ae ^ tWeEl w T {A/3) = R(P). 

(i) We assume (b) since it is necessary. It suffices to prove the equivalence between the following 
two conditions for each Aq G sup^^H! ~~ R(P)} = an d Aq G n w£ E 1 w~ 1 (e(s/)). 

Let Aq G n^g^w -1 (e(s/)) . For any /3 G O, there exists u>o G E\ such that ||Ao/9||i = 
wJAqP = (wo(Aq),P). Since Aq G Wq 1 (e(s/)), wq(Aq) is the weak limit of e{A k ) for some 
A k G A. It follows that = {wq(A ),P ) = \ im k (e(A k ), 0) = lim fe e 1 ' A k f3 < R((3). Now, 

consider Aq G" WQ 1 (e(si)), so that wq(Aq) G" e(s/). This implies the existence of (3 G with 
Po/3||i > {w (Aq),P) > sup Ae ^ je(A),P) = R{fi). ' 

(iii) If R(P) = (u, (3) with u G e(A), then R(b) - R(P) > (u, b) - (u, p) = (u,b- P) forall b, so 
that u G dR(P). Now, suppose v G dR(P), so that R(b) - R{P) > (v,b- p) for all 6 £ fi. Since 
R{b) is a seminorm, taking b = tp yields R(P) = (v,P). Moreover, (v, b — P) < R(b — P) implies 
v G e(A). The proof is complete. ■ 



5.2 Structured Sparsity 

An advantage of the structured l\ regularizer, compared with a general seminorm, is to allow the 
following notion of structured sparsity. A vector j3 is sparse in the structure s/ if 

BWesi: R(P) = (e(W),P), S = supp(IU^), (15) 

for certain set 5 of relatively small cardinality. This means a small structured Iq "norm" ||W/S Ho- 
rn Example 15.21 this means P has low rank. 
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Let es be the 0-1 valued vector with 1 on S and elsewhere. If A G can be written as 
A = (Wj,B] c ) T , then \\Ap\\ x = \\W s p\\i + \\B s °0\\i < R(P), which implies \\B s °0\\i = by {15]). 
By Q3||, e(A) = e((W"J,Sj c ) T ) = e s {W) + e S c(B) G 012(0). Thus, we may choose 

= {e s (W) + esc(B), B S c g 88} C &R(£) (16) 

for a certain class ^ C {S 5 c : (Wj , -Bj c ) T G ^}. 

Now let G = Gag. Since members of G can be written as es(W) + es<:(B),B G this gives a 
decomposition of G as in (fTUl) with Go = {^o} = i e s(W)} and G\ = &s c (0)- 

Since B(3 = for B G we have 

Raifi) = R{P) + sup {u,p- P) = (e s (W),P) + sup (e S c(B),P). 

ueG B^38 

Unless otherwise stated, we assume the following conditions on 88: (a) ws c (8^) = esc{8$) for all 
w G Ei] (b) 88 is convex; (c) es c {88) is closed in Q* . This is always possible since they match the 
assumed conditions on srf '. Under these conditions, Proposition 15.11 gives 

sup (e s °(B),P) = sup \\BPWt = \\py. 

It's dual norm can be defined on fi* as 

IMI^.D = sup{(u,P) : \\P\\& < 1} . 
This leads to the following simplified expression: 

R G (P) = R{P) + sup (u, P-P) = (e s (W),P) + \\py. (17) 

Since B(3 = for all B G 8S, 88 may be used to represent a generalization of the zero coefficients 
of P, while Ws can be used to represent a generalization of the sign of (3. The larger the class 88 is, 
the more zero-coefficients /3 has (thus (3 is sparser). One may always choose 88 = when f3 is not 
sparse. 

5.3 Tangent Space 

Given a convex function (f>(P) and a point j3 G fi, b G 0, is a primal tangent vector if <j)(fi + tb) is 
differentiable at t = 0. This means the equality of the left- and right-derivatives of <j)((3 + tb) at 
t = 0. If </>(/3) is a seminorm and (3 ^ 0, </>(/3 + 1/3) = (1 + t)(j)(/3) for all |t| < 1, so that (3 is always 
a primal tangent vector at f3. If (u,b) < {v,b) for {u,v} G d(f>(/3), then 

{^(0) _ _ t 6)}/( - t ) < (u, 6) < («, 6) < {0(0 + t6) - 0(^)}A, Vt > 0, 

so that (j)(P + tb) cannot be differentiable at t = 0. This motivates the following definition of the 
(primal) tangent space of a regularizer at a point (3 and its dual complement. 

Definition 5.2 Given a convex regularizer R(P), a point (3 G £1, and a class G C dR{(3), we define 
the corresponding tangent space as 

T = Tg = {b G : (u - v,b) = Vu G G,v G G) = n Ujt , e c?ker(it - v). 
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The dual complement of T , denoted by T , is defined as 

T 1 ' = Tq = closure^u : u G O*, (u, b) = for all b G 7~|. 

When (•, •) is an inner product, Cl = Q* and is the orthogonal complement ofT in Q. 

Remark 5.1 Let T be any closed subspace ofCl. A map P-r : — >• Cl is a projection to T if Pf/3 = ft 
is equivalent to f3 G T. For such Pj, its dual P^- : 0* — > 0,* , defined by (PJj-v,f3) = {v,P-j-f3), is a 
projection from 0,* — > P^-Vt* . The image of P^-, T* = P^-Q* , is a dual ofT- Since Pj and Pj are 
projections, v - P^-v G for all v £ Cl* and (3 - Pf/3 G (T*) 1 for all (3 G £1. 

The above definition is general. For the structured i\ penalty, we let G be as in (I16p . we obtain 
by (|14p that j3 G T- The default conditions on 33 implies G so that 

T = {/? : {e S c(B),P) =0V5GJ} = n Be ^ker(B). 

Since G\ = es^{SS), this is consistent with the definition of Section T4. 31 The dual complement of T 
is 

= the closure of the linear span of {es<=(B) : B G ^}. 

5.4 Interior Dual Certificate and Tangent Sparse Recovery Analysis 

Consider a structured t\ regularizer, a sparse f3 G fi, and a set Gg$ C dR{{3) as in (fl"6|) . In the 
analysis of ([TJ with structured i\ regularizer, members of the following subclass of Gag often appear. 

Definition 5.3 (Interior Dual Certificate) Given Ggg in [lb]) , vo is an interior dual certificate 
if 

v G G<%, (v - e s (W),/3) < r]p\\f3\y for some < T]p < 1 for all (3. 

Note that in the above definition, we refer to the dual variable vq as a "dual certificate" to 
be consistent with the literature. This should not be confused with the notation of primal dual 
certificate Qq defined earlier. A direct application of interior dual certificate is the following exten- 
sion of sparse recovery theory to general structured i\ regularization. Suppose we observe a map 
X : £1 — > V with a certain linear space V. Suppose there is no noise so that X(3* = y and /3 = is 
sparse. Then the R((3) minimization method for the recovery of (3 is 

j3 = argmin ji?(£) : X(3 = y}. (18) 

The following theorem provides sufficient conditions for the recovery of /3 by (3. 

Theorem 5.1 Suppose (3 is sparse in the sense of t!5\) . Let G be as in H6\) and T be as in 
Definition ] 5. Si Let V* be the dual ofV, X* : V* — > Cl* the dual of X, Pf a projection to T, P* the 
dual of P to T* , and Vr = XPq-Q.. Suppose (XPf)*, the dual of XP-y, is a bijection from V£ to 
T* and esiW) G T* . Define vq = X* {{X Pq-)*)~ l esiW) . If vq is an interior dual certificate, then 

$ = j3 is the unique solution of U8\) . 

Moreover, vo is an interior dual certificate iff for all (3, there exists i]p < 1 such that (vo — Pj-vq, (3) < 
Vpsup BG3g \\BP\\i. 
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In matrix completion, this matches the duel certificate condition for recovery of low rank (3 by 
constrained minimization of the nuclear penalty [TJ [22] . 

Proof Suppose vq is an interior dual certificate of the form v$ = es(W) + esc(_Bo). Then, for all 
(3 such that Xj3 = y = X(3, 

R(P) - R(P) = R(P) - R(P) - {((XP T )*)- l es{W),X(f3 - p)) 

= R{P) - Rtf) - (vq l p - P) 

> sup (u -v ,f3- P) 

> (l-^)sup \\Bp\\i. 

with r/R < 1. The first equation uses X/3 = X(3, and the second equation uses the definition 
v = X*((XP T )*)- 1 es(W). 

Since (|18p is constrained to X(3 = y = X/3, the above inequality means that (3 is a solution of 
(|18|) . It remains to prove its uniqueness. Let (3 be another solution of (fl~8l) . Since 1 — r)p > 0, if 
R(f3) = R(P), then the above inequality implies that max^g^ = 0, so that (3 G T ■ Since 

(3 G T, XP T ((3 - (3) = X((3 - (3) = 0. This implies (3 - (3 = 0, since the invertibility of (XP r )* 
implies T n ker(XP r ) = {0}. ■ 



When noise is present, we may employ the construction of Section [4.31 For structured-^ regu- 
larize^ the analysis can be further simplified if we assume that there exists a target vector /3* having 
the following property: 

VL(/3*)=a + b, (19) 
with a small a, and b satisfies the condition 

V = \\ly,D < 1. 

Recall that the dual norm || • \\,%,d of || • \\& is defined as ||6||^,d = sup {{b,f3) : \\P\\a < l}. The 

condition means that there exists B G S3 such that b = ws c (B) with ||t(3||oo < V- 

For such a target vector we will further consider an interior subset G C Gag in (j 1 6[) with 
some r\ G [fj, 1]: 

G = {e s (W) + »7esc(B) iBef}. (20) 

It follows that 

R(f3) - Rq(J3) > R Gsg {(3) - R G {P) > v \\P\\ a 

and 

svv{u + VL(P*),P-P) = (e s (W) + a, P-p} + sup (e S c(B) + fiesc(B),p-P) 

> (e s (W) +a,P-P) + ( V - fj) - p\\ a . 

This estimate can be directly used in the definition of RSC in Corollary 14,21 One way to construct 
such a target vector /3* is using (I12p . In this case we may further assume that a = because 
(VL(P*), P) = for any P G T. In general condition (|19p is relatively easy to satisfy under 
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the usual stochastic noise model with a small a since VL(/3*) is small. In the special setting of 
Theorem 15. 1[ we have VL(/3*) = with /?* = f3 = 

For simplicity, in the following we will consider quadratic loss of the form ([6]) and apply Corol- 
lary 031 Consider G in ([20]) . /3* in {19]) with a £ T (fcGT 1 ), and defined as in CD]) but with 
L(/3) replaced by L*(/3) = L(/3) — (VL(/3) — VL(/3*)) T (/3 — /3), which can be equivalently written as 

Ql = p + AQ, AQ = -0.5H^ 1 (e s (W) + d). 

This is consistent with the construction of Theorem 15.11 in the sense that in the noise- free case, we 
can let H = X T X and v = -2HAQ = HH^ 1 e s {W) with 5 = 0. 
We assume that the following condition holds for all (3: 

\\P^HH^(e s (W) + a) - b\\^ D < V , (21) 

which is consistent with the noise free interior dual certificate existence condition in Theorem 15. II by 
setting r]p = r/. The condition is a direct generalization of the strong ir represent able condition for l\ 
regularization in |31] to structured i\ regularization. Under this condition, Qq is a dual certificate 
that satisfies the generalized condition ([5]) in Definition 13.21 with L = L and 5 = 0. Corollary 14.41 
implies that 

D L (p,J) + (1 - rj)\\P\\a < D L (J3,J) + 0.25(e s (W) + a, H^(e s (W) + a)). 

5.5 Recovery Analysis with Global Restricted Strong Convexity 

We can also employ the dual certificate construction of Section 14.11 with G in ([20]) and f} m = (3. 
Corollary 14.11 implies the following result: 

D L (pJ) + (1 - f])0y < j L (P;r,G, || • Wr'WesiW) + a\\ 2 D , 

where jL(/3;r,G, \\ ■ ||) is lower bounded by 

M { wrM : II/ 3 " ^11 ^ ^ D l(P> P) + (V- V)\\P - Mm + (es(W) +a,P-P)<0 

We may also consider a more general /3* instead of assuming For example, consider the 

definition of /3* in (|12p , which implies that a = or simply let (3* = ft* . We can apply Corollary 14.31 
to the quadratic loss function of ([6]). It implies 

D L (P*J) + (l-ri) sup ||-B/3||i < D L (fi*,P) + (j3;oo,G,\\ ■ \\))~ l \\d + e s (W)\\ 2 D , (22) 
where 7^ (f3;oo,G, \\ ■ ||) is lower bounded by 



20 



5.6 Recovery Analysis with Gaussian Random Design 

We can also apply the results of Section 14,21 by considering quadratic loss with Gaussian random 
design matrix in ([9]). We can use the following proposition 

Proposition 5.2 If fj < rj and e ~ N(0, I pxp ), then 

B 2 inf || 7 (u + VL(/3*)) - e|| 2 < inf E e inf h(e s (W) + a + (r, - fj)e s (B)) - e\\ 2 . 

mSG;7>0 7>0 B<=33 

Therefore we may apply Theorem 14.31 which implies that given any g, 5 > such that g + 6 < 
n/y/n + 1, with probability at least 

1-^exp (-^{n/y/n + l-g-5) 2 ^ , 

we have either fj > rj, or 

\\X0 " + (1 " v)\\Py < \\X$ - AOII2 + m^WesiW) + a\\l 

or 

9 2 < ™{ B e~N{o,i pxp ) jrf \h(e s (W) + a+{ V - fj)e s (B)) - eg. 

5.7 Parameter Estimation Bound 

Generally speaking, the technique of dual certificate allows us to directly obtain an oracle inequality 

D L (P,P*) + (l-rj)\\py<6 (23) 

for some 5 > 0. If <5 is small (in such case, (3 should be close to /3*), then we may also be interested 
in parameter estimation bound ||/3 — /3*||. In such case, additional estimates will be needed on top 
of the dual certificate theory of this paper. This section demonstrate how to obtain such a bound 
from p3]) . 

Although parameter estimation bounds can be obtained for general loss functions L(-), they 
involve relatively complex notations. In order to illustrate the main ideas while avoiding unnecessary 
complexity, in the following we will only consider the quadratic loss case, where (•,•) is an inner 
product. 

Proposition 5.3 Assume that L(-) is the quadratic loss function given by (EJ). Consider any sub- 
space T that contains the tangent space T. Let 5' = 5/(1 — if) + ||P-t=/9*||^ with 5 given by $23\). 
Define the correlation between 7~ and T 1 - as: 

cor(f ,f ± ) = sup {\(HP f ^P^)\/(HP f ^,P f ^ 2 :ft + ^eO, \\(3y < 5'} . 
Let A = (3 — /3*. Then, ||A||^ < 5' , and 

{H f P f A,P f A) 1/2 < -t))5' + 2cot(T,T ± ). 
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Proof We have 

(H f P t A,P f A) + 2(HP T A,P±A) + (1 - ? ? )||A||^ 
<(H f P f A,P f A) + 2(HP f A,P±A) + (H f P±A,P±A) + (1 - r,)\\$\\ a + (1 - v)\\Ma 
=D L 0,p.) + (1 - + (1 - v)\\PrP*y < (1 - ^ 

where we have used the fact that = ||-P-r/3*||^. This means that if we let ft = A, then 

we have ||/3||^ < 1, and ft* + ft G fi. Let x 2 = (HfP f A,PfA), we have (HP f A,P±A) = 

x{HP f ft,P±ft)/{HP f ft,P f ft) 1/2 . It follows that 

x 2 - 2xcor (f, / f- L ) < (1 - ??)<5'. 
Solving for x leads to the desired bound. ■ 

Clearly, we can have a cruder estimate: 

cor(f .f- 1 ) < sup{(HP±ft,P±ft) l/2 :ft* + ft e n, \\ft\\<g < 5'}. 
The bound in Proposition 15.31 is useful when H is invertible on T: 

{Hp, ft) > j f {ft, ft) yp e T, 

which leads to a bound on ||PjA||2. Although one may simply choose T = T, the resulting bound 
may be suboptimal, as we shall see later on. Therefore it can be beneficial to choose a larger T ■ 
Examples of this result will be presented in Section |6j 

6 Examples 

We will present a few examples to illustrate the analysis as well as concrete substantiations of the 
relatively abstract notations we have used so far. 

6.1 Group i\ Least Squares Regression 

We assume that = M p , and consider the model 

Y = Xft* + e 

with the least squares loss function ([9]). This corresponds to the quadratic loss ([6]) with H = X T X 
and z = 2X T Y. The inner product is Euclidean: {u, b) = u b. 

Now, we assume that p = qm, and the variables {1, . . . ,p} are divided into q non-overlapping 
blocks Tx, . . . , Tq C {1, . . . ,p} of size m each. One method to take advantage of the group structure 
is to use the group Lasso method [28j with 

i?(/3) = A||/?|| r ,i, ||/%i = EU/M* ( 24 ) 

3=1 
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Its dual norm is 



r,oo = max||^ r ,||2- 

3 



Group li regularization includes the standard l\ regularization as a special case, where we choose 
m = 1, q = p, and Tj = {j}. 

Group-£i regularizer is a special case of (|13p . where we have 

st = {A = (aj) : Aft = (aj/3) j= i v .. i9 : aj G MP, ||a 3 -|| 2 < A, supp(aj) C Tj}. 

For a group sparse (3, its group support is the smallest S C {1, . . . , q} such that supp(/3) C S = 

Ukes^k- We may define sgn r (/3 r ,) to be sgn r ( / 5r 3 ) = PrJWPvjh wlien j G S > and s g n r(^r 3 ) = 
when j ^ S. 

Using notations in Section [5j we may take W = (Asgn r (/3r J ))j=i t ...,g, and S$ = {B = (bj) G 
srf : bj = for all j G S 1 } in (|16p . In fact, our computation does not directly depend on W and £$. 
Instead, we may simply specify 

e s (W) = Asgn r (/3) and e s ^) = {b G R p : ||fe|| r ,oo < A; supp(6) C S c }, 

and 

= A||^s c ||r,i ||^s c ||^,d = ||&s c ||r,oo/A. 

This means that we may take G in (I20p as G = {u; us = Asgn r (/3) & ||Hs c ||r,oo <• t?A} 
for some < 77 < 1, which implies that R(/3) — Rg{P) > (1 — ??)||/3s c ||r,i- The tangent space is 
T = {u : supp('u) G S}. 

We further consider target /3* that satisfies (fT9|) . which we can rewrite as 

2X T (X((3,-^)-e) = a + b, 

where supp(fe) C S c , and ||6||r,oo = 77A. We assume that H&H2 is small. Note that we may choose A 
sufficiently large so that fj can be arbitrarily close to 0. In particular, we may choose A > ||&||r,oo/ ? ? 
so that fj < rj < 1. We are specially interested in the case of a = 0, which can be achieved with the 
construction in (11211. 



Global Restricted Eigenvalue Analysis 

Assume that A > ||&||r,ocA7) an d let V = ||&||r,oo/A. We have fj < 77. Therefore in order to apply 
(I22D . we may define restricted eigenvalue as 

7 = inf {2||XA/3||2/||A/3|| 2 : 2\\XA/3\\ 2 2 + A/3 T (Asgn r (^) + a) + (r? - fj)X\\ Ate\\ T)1 < o} . 
We then obtain from (f22l) 

\\X0 ~ + (1 " ^AH^IIr.i < IW - + (2 7 )- 1 ||a + Asgn r (/3)|&. 
If we choose a = 0, and let || • || = || • ||r,i with || • = \\ ■ ||r,ooj then 

\\X0 - AOUl + (1 - ^AH^IIr.i < \\X(P - + 
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with 

7 = inf {||XA/3|||y(||A/3|| 2 :1 /|S|) : A/3jsgn r (^) + (r, - fj)\\^Mr,i < o} . 

The result is meaningful as long as 7 > 0. Even for the standard i\ regularizer, this condition is 
weaker than previous restricted eigenvalue conditions in the literature. In particular it is weaker 
than the compatibility condition of |25| (which is the weakest condition in the earlier literature), 
that requires 

inf{||XA/?||!/(||A/3||?/|S'|) : (l-r?)||A/3 S e|| r>1 < (1 + ? ])||A^ s || rjl } > 0. 

Our result replaces ||A/3s||r,l by — A/3gSgn r (/3), which is a useful improvement because the former 
can be significantly larger than the latter. For i\ analysis, the use of sgn(/3) has appeared in various 
studies such as |26[ [TOl El [6]. In fact, the calculation for Gaussian random design, which we shall 
perform next, depends on sgn(/3) and sgn r (/3). 



Gaussian Random Design 

Assume that X is Gaussian random design matrix in ([9]), then we can apply the analysis in Sec- 
tion [52J We will first consider the standard t\ regularizer with m = 1, which requires the following 
estimate. 

Proposition 6.1 Consider standard l\ regularization with single element groups. If fj < r] and 
p > 2\S\, we have 

inf E e~N(o,l pXp ) inf ||7(Asgn(/3) + a) + 7(7/ - fj)\bs° - e\\j 

I 1 1 0|| 00 S: J- 

<2|^ + 2 yj- 1} 11^05) + a/A||l. 

Proof Given 7 > 0, and let t = 7(7/ — fj)X, we have 

E e~JV(o,/ pXp ) inf ll7(Asgn(^) + a) + 7(77 - r?)A6 5 c - e||| < a + ai, 



where 
and 



a = E e ^ (0>/pxp) ||7(sgn(/3) + a) + e s ||i = |5| + 7 2 ||Asgn(^) + a\\ 2 2 , 



ai =E e ^ iV ( 0i/pxp ) inf - es<=|| 2 

00 



=(p-|5|)E e ^ (0il) (|e|-t)2 



OO n 

=(p-|Sl) / ^=x 2 exp(-(x + t) 2 /2)dx 

f' OO o 

<(p-|5|) / -= a; 2 e X p(-(x 2 + f 2 )/2)^<(p-|5'|)e-* 2/2 . 
7x=0 v2vr 

By setting t = ^/2\n({p/\S\ - 1) and 7 = ^2 ln((p/|5| - l))/{r] -fj)X, we have a x < \S\. This gives 
the bound. ■ 
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For the standard i\ regularization (m = 1), we obtain the following bound if p > 2\S\: given 
any 77 G (0, 1], g, 8 > such that g + 5 < nj\Jn + 1, with probability at least 



l--exp 



±(n/V^Ti-g-5) 2 y 



we have either A < H&Hoo/r/, or 

~ #011! + (1 " ^AH^cllx < HX^ - &)||| + (^J-^Asgn^) + a\\l 

or 

/<2| S | + f n ^'f-') || 8g n(ffi + a/A|ll. 

fa- IHIoo/A) 2 

Note that in the noise-free case of d = b = 0, this shows that exact recovery can be achieved 
with large probability when n > 2\S\{\ + ln(p/|5| — 1)), and this sample complex result is a rather 
sharp. More generally for m > 1, we have a similar bound with worse constants as follows. 

Proposition 6.2 If fj < r] and p > 2m|5'|, we have 

inf [B e ^ N{0I ) inf ||7(Asgn r (/3) + a) +7(7/ - r))A&<j c - e\\l 
7>0 p ^ ||6||r )00 <l 



(V21n(g/|5|-l) + V^: 



2 



<|S|(m + 1) + ^^r^_v - v-/ || sgrir (^ + 5/A ||2. 

Proof Given 7 > 0, and let i = 7(7/ — r])A. Let % be a ^-distributed random variable of degree m, 
with X m being its expectation as defined in Theorem 14.21 Since x 1S the singular value of a 1 x m 
Gaussian matrix, similar to Theorem 14. 2| we can apply the Gaussian concentration bound |21j to 
obtain for all 5 > 0: 

P[X> A m + 5] < 0.5exp(-5 2 /2) • 

Now we assume t > X m , and 



where 
and 



E e~iV(o,W) „,„ inf ll7(Asgn r G9) + a) + 7(7/ - fj)\b S c - e\\l < a + a x , 
|o||r,oo<l 

ao = E e ^7V(o,/ pxp )||7(sgn r (/3) + a) + e s ||| = m\S\ + 7 2 ||Asgn(/3) + a\\%, 



ai =E e ^ N{0Jpxp) inf ||6 SC - e Sc || 2 
1 1 1> r.ooS* 



i q -\S\)B^ mimXm) (\\e\\ 2 -ty + 

/•oo 

---(q-\S\) x 2 dP( X >x + t) 



x=0 
00 



poo 

<2(q-\S\) xP( X >x + t)dx 

Jx=0 
poo 

<2(g-|5|) / 0.5xexp(-(x + t- X m ) 2 /2)dx 

Jx=0 



<(q - \S\)exp(-(t - XmY/2) / xexp(-x 2 /2)dx 

Jx=0 

= ( q -\S\)exp(-(t-X m ) 2 /2). 
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By setting t = A m + y/2\n(q/\S\ — 1) and 7 = t/{r] — 77) A, we have a\ < \S\. This gives the desired 
bound using the estimate A m < \fm. ■ 

We obtain the following bound for group-Lasso with m > 1 when q > 2\S\: given any rj £ (0, 1], 
g, 5 > such that g + 5 < n/\Jn + 1, with probability at least 

i_I exp ^_I( n /Vn + i- 5 -5) 2 ^ , 

we have either A < ||&||r,oo/f?> or 

\\X0 ~ flOlll + (1 " »7)A||/3sc||r,i < - + {^T^mrifi) + ~4l 



or 



9 , < + + (V2M,/|S| -l) + ^ | |^ r( g ) + a/A|| , 

(?? - PI|r,oo/A) 2 

Note that in the noise- free case of a = b = 0, this shows that exact recovery can be achieved 
with large probability when n > \S\(m + l) + \S\( ^/21n(g/|5| - 1) + y 7 ™) 2 = 0(|,S|(m + ln(g/|S|))). 

If we consider the scenario that noise e ~ N(0, (7 2 I nX n) is Gaussian, then we may set A to be at 
the order o\J n(m + ln(q/\S\)), and with large probability, we have A > ||6||r,oo/^j with a nonzero 
a such that ||a||2 = 0(\S\X 2 ). This gives the following error bound with 5 chosen at order y/n: 

\\X0 - ml + (1 - v)A\Mr,i < \\X{p - ml + 0(\S\\ 2 /n). 
With optimal choice of A, we have 

11*09 " AOIll + (1 " v)M\M\r,i < \\X(P - A0II1 + 0(\S\m + ln(g/|S|)). 
Tangent Space Analysis 

In this analysis, we assume that supp(a) S S. We can then define 

Q^^ + AQ, AQ s = -0.5(X^X s )' 1 (Xsgn r (i3s) + as) and AQ S <= = 0. 

We know that Qq is a dual certificate if 

||Xj c X s (XjXs)- 1 sgn(/3 s )||r,oo < rj - \\X^X S {X^ Xs^as - M|r,oc/A. 

This is essentially the irrepresentable condition of [T], which reduces to the l\ irrepresentable con- 
dition of |31) when m = 1. This condition implies the following oracle inequality: 

IW* - 0)111 + (1 - ?7)A||/3sc||r,i < \\X& - P)\\l + 0-25A 2 ||(XjX s )- 1 / 2 (sgn r (/3 s ) + a s /\f 
This oracle inequality generalizes a simpler result for m = 1 in [5]. 
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Simple Parameter Estimation Bounds 

Next, we consider the parameter estimation bound using Proposition 15.31 First, we consider the 
case of choosing T = T; let 75 be the smallest eigenvalue of XgZg. If we assume ~ j3 and a is 
small, we can expect a bound of the form: 

~ ml + (1 - ??)A||/3sc ||r,i < 6 = 0(X 2 \S\/ ls ), 



where A = O(oyjn(m + Inq)). Now, if we let Xr j be the j-th group-column (with indices P,) of 
X, then 

corCT,-^) < sup{||(XjXs)" 1/2 XjX/3 Sc || 2 : A||/3 S c||r,i < 5'} < 7 ~ 1/2 max \\XjX Tj \\ sp S'/X, 
where 

75 = inf{||X s /3s||Ml/3s||2 = l} 
is the smallest eigenvalue of Xg~Xg. Proposition 15.31 gives || (/3 — /3*)s c ||r,i < S'/X and 



- /3*)s|| 2 < V^s^ 1 " + 2(5'/A)7s 1 ™ ||XjX rj ||sp, 

where (5' = 6/(1 — rf) + A||(/3*)s c ||i, and here we use || • || sp to denote the spectral norm of a matrix. 

For the sake of illustration, we will next assume that the standard error bound of 5' = 0(X 2 \S\/ r yg)i 
and the above result leads to the following bound 



||(/3-/3*)s=||r,i = 0(A|£|/ 7 s), 



AOslh < (xV\S\/ls) -o(l+ VWlls 1 ™ ||X s T X r3 || Bp ) . 



If X is very weakly correlated, X^Xr, will be small. In the ideal case 7 5 1 maxj<=s c ||Xg"Xr . || S p = 
0(l/y/\S\), we have 

||/3-^|| r ,i = 0(A|S'|/75), 

which is of the optimal order. However, in the pessimistic case of 7,7 1 maxj^S" \\Xg X^ j || S p) = 0(1) 
then we obtain 

||/3-/3*||r,i = 0(A|S| 3 / 2 /7s), 



which has an extra factor of \J \S\. Using the above derivation, the 2-norm error bound is always 
of the order 

W - &||a < IK/3 - /3*)s|| 2 + IK/3 - PM\v,x = 0(A|5|/ 7S ), 



which has an extra factor of y/\S\ compared to the ideal bound of \\/3— /3* W2 = 0(Ay |S|) in the earlier 
literature such as |13[ [T7] under appropriately defined global restricted eigenvalue assumptions. 

It should be mentioned that the assumptions we have made so far are relatively weak without 
making global restricted eigenvalue assumptions, and thus the resulting bound \\j3 — = 0(X\S\) 
might be the best possible under these assumptions. In order to obtain the ideal bound of \\$— /3*||2 = 
0(y/\S\) (as appeared in the earlier literature), we will consider adding extra assumptions. 
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Refined Parameter Estimation Bounds 



The first extra assumption we will make is that sparse eigenvalues are bounded from above, which 
prevents the pessimistic case where Xj are highly correlated for j £ S°. Such correlation can be 
defined with the upper sparse eigenvalue as: 

p+(fc) = {||X/3||i/||/3||Msup Pr (/3)|<fc}, 

where supp r (/3) C {l,...,q} is the (smallest) index set for groups of {Tj} that cover supp(/3). 
Using this notation, if we choose the constrained Q and /3* such that j3 + /3* £ Q implies that 
||/3||r,oo < M for some M < 5'/X, then it can be shown using the standard shifting argument for 
group l\ regularization (e.g., |13| ) that for all positive integer k < 5'/(XM): 

zor(T,T L ) < sup{||X/5 sc ||2 : W\\v,oo < M, \\\Mr,i < 6'} < 2p + (k - l) l l 2 5' /{XVk). 
This implies that 

||(/3 - &) s || 2 < ^(l-^ + ^Vl* " l) 1/2 /(A^), 

and 

- < y^M/X < S'/(XVk). 

Therefore assuming the standard error bound of 5' = 0(A 2 |5|/7s), we obtain 

||/3 - p t \\ 2 = 0(X^\S\/ ls ) inf [l + VWkVP + (k)/ls ■ 

If M is sufficiently small, then we can take k sufficiently large so that \S\ = 0(k), and it is possible 
to obtain error bound of \\j3 — f3*\\2 = 0(X^/\S\). 

If we do not impose the || • ||r,oo norm constraint on 0, then another method is to choose T 
larger than T, which is the approach employed in [6] for the standard i\ regularization. Here we 
consider a similar assumption for group-Lasso, where we define for all integer k > 1: 

ls ,k = inf {\\XP\\l : |su PPr (/3) \S\<k, ||/3|| 2 = 1}. 

It is clear that 75 = 7s 1. Given any k such that js,k is not too small, we may define 

r={/3:su PPr (/3)c5} 

and 

S = S U {group indices of largest k — 1 absolute values of ||/3 — /3*||gj : 3 ^ 

The smallest eigenvalue of Hj- is no smaller than 7s,fe, and we also have ||(/3 — /3*)g||r,oo < M = 
- P*)s c \\v,i/k < 5' /(kX). Using the same derivation as before, we have 



P*h = 0(Xy/\S\/ 1S ,k 



l|^+(fc)/ 7s , t 



This means that if we can choose k at the order of \S\ such that p + (k)/js,k = 0(1), then we have 

||/3-/3*|| 2 = 0(AVW)- 



28 



In the standard l\ case, the requirement of p + (k)/ r ys,k = 0(1) is also needed in the so-called 
"RIP-less" approach of [6] to obtain the ideal bound for ||/3 — /3*||2- The approach is called "RIP- 
less" because this condition is weaker than the classical RIP condition of [8] (or its group-Lasso 
counterpart in |13| ) that is far more restrictive. This bound is also flexible as we can choose any 
k > 1: in the worst case of k = 1, we have — /3*||2 = 0(A| S'l ) with an extra y/\S\ factor. This 
extra factor can be removed as long as we take k at the order of 



6.2 Matrix completion 

Let fl be the set ofpxq matrices, and assume that the inner product is defined as /?') = tr(/3 T f3'). 
We consider Xi,...,x n and observe 

Ui = (Xi,^) + €i, 

where {e{\ are noises. In order to recover we consider the following convex optimization problem: 



P = argmin 



^((^,/3)-2/,) 2 + A 



i=l 



where is the trace-norm of matrix f3, defined as the sum of its singular values. 

In the following, we will briefly discuss results that can be obtained from our analysis using the 
tangent space analysis. For simplicity, we will keep the discussion at a relatively high level, with 
some detailed discussions skipped. 

We assume that /3 is of rank-r, and /3 = UT I V T is the SVD of /3, where U and V are p x r 
and q x r matrices. The tangent space is defined as T = {/3 : Pt(P) = P}, where Pt{P) = 
UU T fi + pVV T - UU T f3VV T . 

Using notations in Section [5j we may take es(W) = UV T and es c (&) = {b € T 1 ' : ||6||sp — ^} 
in (|16p . Therefore 

\\py = \\\P^f3\\* \\Prt>y,D = \\Pt bllsp/A. 

This means that we may take G in (|20p as G = {u : PfU = XUV T & ||P^ti|| sp < r]X} for some 
< i] < 1, which implies that R(/3) - Rg(P) > (1 - v)\\Pr P\h- 

We further consider target /?* that satisfies (fT9|) . which we can rewrite as 



2^2 x i{(xi,/3* ~ P*) ~ ei) = a + b, 
i=i 

where b C T^, and ||6|| = ?]A. We assume that ||5||2 is small. 

For matrix completion, we assume that {xi} are matrices of the form e a ^ with 1 at entry (a, b) 
and elsewhere, where (a, b) is uniformly at random. It can be shown using techniques of [22] 
that under appropriate incoherence conditions, a tangent space dual certificate can be constructed 
with large probability that satisfies (|2ip . Due to the space limitation, we skip the details. This 
leads to 

D L {P*J) + (l-n)\\\P£p\\* < D L (p„p) + 6, 5 = 0.25{\UV T + a,H T 1 (\UV T + a)). 
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Note that for sufficiently large n, the smallest eigenvalue of Hj- can be lower bounded as 0(pq/n). 
Since (XUV T , XUV T ) = X 2 r, we may generally choose A such that (a, a) = 0(A 2 r), we thus obtain 
the following oracle inequality for matrix completion: 

D L (P.J) + (1 - v)M\PtP\\* < D L (P*J) + 0(\ 2 pqr/n). 

If €i are iid Gaussian noise iV(0,<7 2 ), then we may choose A at the order a \fn In max(p, q) / min(p, q) . 
This gives 

D L (p.J) + (1 - v)HPtP\\* < D L (P*J) + 0(a 2 max{p,q)rln{p + q)). 

In the noise-free case, we can let A — > 0, and exact recovery is obtained. This complements 
a related result of |16| that does not lead to exact recovery even when a = 0. In the noisy case, 
parameter estimation bounds can be obtained in a manner analogous to the parameter estimation 
bound for group l\ regularization. Due to the space limitation, we will leave the details to a 
dedicated report. 

6.3 Mixed norm regularization 

The purpose of this example is to show that the dual certificate analysis can be applied to more 
complex regularizers that may be difficult to analyze using traditional ideas such as the RIP analysis. 
The analysis is similar to that of group l\ regularization but with more complex calculations. For 
simplicity, we will only provide a sketch of the analysis while skipping some of the details. 
We still consider the regression problem 

y = X% + e, 

where for simplicity we only consider Gaussian noise e ~ N(0,a 2 I nxn ). We assume that p = qm, 
and the variables {1, . . . ,p} are divided into q non-overlapping blocks T\, . . . , T q C {1, . . . ,p}, each 
block of size m. 

The standard sparse regularization methods are either using the Lasso regularizer of (|2|) or using 
the group-Lasso regularizer of (I24p . Let Ss = supp(/3) and Sr = supp r (/3), we know that under 
suitable restricted strong convexity conditions, the following oracle inequality holds for the Lasso 
regularizer ([2]) 

\\X(P* ~ 4)111 + (1 - ^)A||%||r,i < \\X(P* - ml + 0(a 2 n\S s \ lnp/ 7 sj, 
and the following oracle inequality holds for the group Lasso regularizer (|24D : 

IW* - + (1 - r])X\\p S o || rja < - Ml + 0(a 2 n\S r \(m + lng)/ 7Sr ). 

Note that we always have \Ss\ < |Sr|?"7i- By comparing the above two oracle inequalities, we can see 
that the benefit of using group sparsity is when \Ss\ ~ 1 5"r | ^ , which means that sparsity pattern 
occur in groups, and the group structure is correct. In such case, the dimension dependency reduces 
from \Ss\ lnp to \Sr\ lng ~ jn" 1 ^^ lng. However, if some of the signals do not occur in groups, 
then it is possible that |Sr|m can be much larger than \Ss\, and in such case, Lasso is superior to 
group Lasso. 

It is natural to ask whether it is possible to combine the benefits of Lasso and group Lasso 
regularizers. Assume that /3 is decomposed into two parts /3 = P' + P" so that P" covers nonzeros of 
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(3 that occur in groups, and f3' covers nonzeros of f3 that do not occur in groups. Ideally we would 
like to achieve an oracle inequality of 

\\X(f3*-P)g + (l- v )\\\p\\ (25) 
<pT(ft - + O (™ (|supp(/3')| lnp + |sup Pr (/3")|(m + \nq)fj 

=\\Xfa - + O (|supp(£ \ U jeS Tj)\ Inp + \S\(m + In?))) , 

where ||/3|| is a certain seminorm of /3, and S = {j : (m + lng) < c|supp(^rj)| lnp} for some constant 
c > 0. We note that the optimal decomposition can be achieved by taking f3' r , = with /3p\ = f3-p j 
when j G S' and and /3p = with /3 r . = /3r\,- otherwise. 

In the following, we show that the oracle inequality of (|25p can be achieved via a mixed norm 
regularizer defined below: 

R(J3) = fi= M +pii [AiH/3'Hi + A r ||/3"||r,i] ■ (26) 

This mixed regularizer can be referred to as the infimal convolution of Lasso and group Lasso 
regularizers, and it is a special case of |14| . If we can prove an oracle inequality of (|25|) for this 
regularizer, then it means that we can adaptively decompose the signal /3 into two parts (3' and 
j3" in order to achieve the most significant benefits with standard sparsity bound for (3' and group 
sparsity bound for /3" (without knowing the decomposition a priori). 

We will consider the decomposed parametrization [f3' , f3"], and the mixed norm regularizer (|26p 
becomes a special case of (I13p . Although the loss function L(-) is not strongly convex with respect 
to this parametrization, this does not cause problems because we are only interested in /3 = j3' + /3". 
Since L(-) is strongly convex with respect to (3 with an appropriate tangent space T, we only need 
to consider the direction along f3 = f3' + (3" when applying the results. In this regard, it is easy to 
verify that at the optimal decomposition in (l26l) . there exist u' S <9||/3'||i and u" G <9||/3"||r,i such 
that \\u' = Aru" . Moreover, for any such (u',u"), A\v! G dR(f3). 

In order to define T, we first define SS. Consider St = {j '■ Ar < 2Ai||sgn(/3)r H2}) with the 
corresponding support Sr = Uj g s r rj. The meaning of St is that groups in St are allowed to use 
both standard and group sparsity to represent j3, while groups in Sp always use standard sparsity 
only. The set Sr will expand the tangent space for the nonzero group sparsity elements. We also 
define the tangent space support set for single sparsity elements as Si = supp(/3) U Sr- Let 

[^,/3"] = arg min [XiMi + M\P"\\r,i] ■ (27) 

(/3',/3"):/3=/3'+/3" 

It satisfies AiVp'||i = A r V||/3"|| r ,i, and VR(fi) = AiVpfJi + A r V||^ ||r,l- Consider Tj such 
that / 0, we obtain from AiVp'||i = A r Vp"||r,l that [Vpf.||l]i + onl y when h 7^ for 
i G r j; therefore ||(Vpf. ||i)|| 2 < ||sgn(^) rj || 2 , and thus A r < Ai||(Vpf. ||i)|| 2 < Ai||sgn(^) r . || 2 . It 
implies that j G St and thus supp(/3") C St- Now we can define W and 38 as 

e S (W) = AiVp'Hi = ArVH^'Hr,! 

where we can take [V||p ||i]j =0 when j ^ Si; and define 

e S c(38) = {us? : ||«sj ||oo < Ai & ||its°||r,t» < 0.5A r }. 
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With the above choices, we have for all u £ es c (&), es(W) + u G dR{(3) because it can be readily 
checked that e s (W) + u£ <9(Ai ||/3'||i) fl d(\ T \\P"\\v,i)- Moreover, we have 

sup (u, (3) = min 

We can thus define G according to ([20]) as 

G = {e s {W) + rju:u€ e S c(&)}, 

so that G C &R(/3). 

For simplicity, we assume that /3* satisfies (fT9|) with a = 0, which can be achieved with the 
construction in (|12|) . With these choices, we obtain from (|22|) the following oracle inequality (under 
appropriate restricted eigenvalue condition with parameter 7): 

\\X0-^)\\l + (l- V ) . min 

<||JT(^ - A)||3 + T-^dlcsCWOHl) 
<\\X(fi - flOIll + r X (A?|supp(^) \ S r \ + A^|Sr|) • 

The last inequality follows from 

\\es(W)\\l < E ^rll(Vp"||r,i)r,||i + £ ^IKVp'llx)^ III, 

which is a consequence of es(VF) = AiV||/3'||i = ArV||/3"||r i- 

Similar to the standard Lasso and group Lasso cases, for mixed norm regularization, we may 
still choose the Lasso regularizer parameter Ai = c\o^J n ln(p), and the group Lasso regularization 
parameter Ar = C2(J\J n(m + ln(g)) so that (I19p holds (ci,C2 > are constants). Plug in these 
values, we obtain the following oracle inequality with this choice of parameters: 

p-G9-&)||! + (l-»7). min 

1 °J a p 

<\\X{p - /3*)l|l + 7 -1 ™r 2 • O ((n + lnp)|supp(/9) \ S r | + |Sr|(m + In?)) . 

Since the definition of Sr is such that j £ Sr when m + \n(q) < 4(ci/c2) 2 |supp(^rj )| hi(p), the right 
hand side achieves the optimal decomposition error bound in (|25|) . This means that the mixed norm 
regularizer (j26|) achieves optimal adaptive decomposition of standard and group sparsity. 

6.4 Generalized linear models 

Results for generalized linear models can be easily obtained under the general framework of this 
paper, as discussed after Corollary 13.11 This section presents a more elaborated treatment. In 
generalized linear models, we may write the negative log likelihood as 

n 

X>(^»> ( 28 ) 
i=l 



Ai||^s f ||i + 0.5Ar||^|| ri i 



Ai||/3s ? lli + 0.5A r ||^s f ,llr,i 



Ai ||/3^ ||i +0.5Ar||^c||r,i 
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where Xj G f2* and £j may depend on certain response variable y,;. Suppose are convex and 
twice differentiable. Let 

k = max sup I log(£-'(i)) - log(4'(s))|/|i - s| 

be the maximum Lipschitz norm of \og(£'-(t)). We note that k = 1 for logistic regression with 
•^i(t) = ln(l + e - '), k = 1 for the Poisson/log linear regression with £i(t) = e — yrf, and k = for 
linear regression. For sparse /3, C C fl, norm || • ||, and j = 1, 2, define 

7i (/3;r,C, || • ||) = mf { g mm ( p _ - p , ^^p- J = f*€ C). 

The following lemma can be used to bound Dl(/3,/3) and Dl(/3,(3) from below. 

Lemma 6.1 Given (3, C C and norm || • ||, let P G & such that the ray from j3 to (3 and beyond 
intersects with C, {t(/3 - p) + (3 : t > 0} n C 0. // < ||/9 - /3|| < r, 

PUP J) . r s , h g^ff) ^ ,5 ~ || | h 

^ ^ II • ID. ^ t 2 (/3; *r,C, || • ||). 

Proof Let ft = io(/3 — ft) + ft EC. Since jj(ft;r,C, \\ ■ ||) is decreasing in r, it suffices to consider 
< ||/3 — ft\\ = r. Since k is the Lipschitz norm of log(^'(i)), 

D L (pJ)/r 2 = / (l-t)Y,%{{*iJ) + t{*i,P ~ P)){*»P - Pfdt/r 1 







i=l 



> [\l - t) f;^((x l , / 3))e- iK l^^l(x,,/3 - £) 2 di/r 2 
^ i=l 

n - r 1 

> y2af({ Xi J)){x u P-P) 2 (l-t)I{tK\( Xi ,f3-P)\<l}dt/(er 2 ). 
i=i J ° 

Since Jq M (1 - t)dt = x A 1 - (x A l) 2 /2 > (x A l)/2, we find 

D L (P,P)/r 2 > ^^((x i>/ 3))(x i ,/3-^) 2 min(l, 



KKx^-^l^er 2 

^ %{{xiJ)) . {{x u /3-pf \{x u l3-P)\ 
> > mm 



2e V ||/3 -pf ' Kr | 



Since (xi,/3- P)\/\\P -J\\ = (x h P - P)\/\\P - P\\ and p EC, D L (p, p)/r 2 > 7l (p; nr,C, 
The proof for Di,(ft,ft) is similar. We have 



D L (P,P)/r 2 = / TlU&J) +t(x>,P - P))(x u P - P} 2 tdt/r 2 
Jo i=i 

> y^t'l{{x h P)){x u p-P) 2 I{tK\{x h p-p)\<l}tdt/(er 2 ) 
i=l Jo 
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This gives D L (P,(3)/r 2 > 72 (/3; Kr,C 
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Suppose 7j(/3;ro,C, || • ||) > 70 for j = 1,2. Lemma 16.11 asserts that for the /3 considered, both 
D L (P,0) and D L (P,fi) are no smaller than 7 ||/3 - P\\ 2 for k\\(3 - (3\\ < r . For larger r = \\/3 - P\\, 

D L (0J) >r 2 7l (P; K r,C, || • ||) > ||/3 - /3|| (r / K ) 7 i (A r ,C, || • ||), 
D L (P,P) >r 2 l2 (fcKr,C,\\ ■ II) > (r /K) 2 72 (/3; r 0> C, || • ||). 

Since Dl((3,/3) is convex in /3 and Dl(/3,/3) is not, such lower bounds are of the best possible type 
for large ||/3 — p\\ when are small for large i, as in the case of logistic regression. 
Given /3, setting Cq = {P : sup ug <3 (u + VL(f3), f3) < 0} yields the lower bound 

7L(/S;r,G,||-||)>7i(/3;^,C G ,||-||) 

for the RSC constant in Definition 14.11 The lower bound Dl((3,(3) > (ro/ft) 2 72(/3; vq,C, \\ • ||) can 



be used to check the condition Dl(/3,/3) > Di(/3,f3) in Corollaries 13.11 and 14.21 
We measure the noise level by 

rX/3*) = bvp{\(VL(P.),P)\/R(J3) :^0,^U}. 

Let j3 be a sparse vector and G C dR(J3). Given \\ ■ ||}, we measure the penalty level by 

A09.&; || • ||) = sup{(VL(/?*) +u,/3 - /3)/||/3 : u G 5i?(/3),/3 G 0}. 

Since for all u G dR{P) and u G dR(j3), we have (u, P — P) < (u,/3 — /3), it follows that 

A(£,/3*;|| • ||) < inf \\VL(p*) + u\\ D , 

where || • is the dual norm of || • ||. This connects the quantity A(-) to inf ug c \\u + V-L(/3)||.d used 
in Theorem 14.11 

Similarly, we may define 

C Ut = U: sup <u + VL(/U/3-/3> > o}. 

Note that we have Cg ^ C |/3 : ir &u&dR{j3) + VL(P*), P ~ P) > o|, and this relationship connects 

the quantity 72 IjC^ ^ || • ||) in Theorem 16. II to the quantity 7l(/3; r, G, || • ||) in Defintion l4.11 The 
following result for generalized linear models is related to Theorem 14. 1| but is more specific to the 
loss function (12811 and more elaborated. 



Theorem 6.1 Suppose r](f3*) < 1. Let (3 be a sparse vector such that 

sup w _ a, < ^W';f + *f*J'-j> . (29) 

/3eC^ k 2 A(/3,/3*; || • ||) 4 72 (/3; 1,C^ A || • ||) 

T/ien, 

A 2 (A/3*;MI) 



D L (P,P*)<D L (p,P*) + 



4 72 (/3;l,C^J 
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Proof . Let = + 10 - 0). Define 

f(t) = D L 0,0*) - D L 0,P*) = L0) - L0)+t(-VL(0*),0- 0). 

The function f(t) is convex with /(0) = and f'{t) = (VL0) - VL(0*),0 - 0). If /'(l) < 0, then 
Dl0, 0*) ~D L (j3, /?*) = f(l) < /(0) = and the conclusion holds. Assume f'(l) > in the sequel. 

Let u = -VHP). By ©, u G &R(/3). Since f(l) = (u + VL(0*),0 - 0) > J £ C^. It 
follows that f'(l) < \0,0*; \\ ■ \\)\\0 - /3||. By Lemma O 

/(t) - /'(*)< = -£>l(A/3) < -\\0 - 0f l2 0;n\\~0 - 0%CfoJ[\ ■ ||). 

Consider two cases. If n\\0 — 0\\ < 1, we set t = 1 to obtain 

/(i) < /'(i)-^-^ 3 ^;!,^!!.!!) 

< \{fi, & ; || • ||)||/3 - ^|| - ||/3 - 0\\ 2 l2 0; i,c^J| • ||). 

Taking the maximum of x\0,0*\ \\ • ||) — x 2 ^ 2 0; — ^Hj^s s„ll ' II)' we nn d 

ZM/U) - #l(AA0 = f(l) < - X )f?f V }) xW 

472(/3;l,C^J| • ||) 

For k||/3 - /3|| > 1, we set t < 1 so that k\\0 - 0\\ = 1 

/(I) < /'(I) + /(*) - tf'(t) < \0,0*; || • ||)||,9 - 0\\ - K~ 2 l2 0- 1,C^J • ||). 
This gives /(l) < A 2 (^,/3*; || • ||)/{4 72 (A 1,C^J| • ||)} when 

II 3 _ g|, < 72(^;1,^J|-||) A(ftft;HI) 
IIP Pl1 - k 2 A(^;|H|) ^ 4 72 (y9;l,C^J|-||)' 

The proof is complete in view of the assumed condition on 0. ■ 

Condition flU} holds if sup^ \\0\\ < A and 2A < j 2 0; LCg^J • ||)/{k 2 A(/3, #»; || • ||)}. This is a 
weaker condition that the condition discussed after Corollary 13. ll because the quantity \0,0*; ||-||) < 
infuedR(p) \\VL(0*) +u\\d is generally very small, which means that we allow a very large A. Under 
this relatively weak condition, Theorem 16.11 gives an oracle inequality for generalized linear models 
that can be easily applied to common formulations such as logistic regression and Poisson regression. 
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